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Abstract —  This  paper  extends  the  performance  analysis  of  a 
controlled  database  unit  studied  in  Wu,  Metzler,  and 
Linderman  (2005)  to  include  the  cases  where  errors  and  delays 
can  occur  in  state-based  control  actions  as  a  result  of 
uncertainty  in  the  knowledge  of  the  system  state.  The  paper 
details  the  way  such  errors  and  delays  are  captured  through 
augmenting  the  state  space  in  the  Markov  model  of  the  database 
unit.  State  variable  feedback  is  used  to  activate  the  process  of 
restoration  upon  the  failure  of  one  of  the  database  servers  in  the 
unit.  The  performance  of  the  database  is  evaluated  in  terms  of 
the  resulting  mean  time  to  unit  failure,  the  steady  state 
availability,  the  expected  response  time,  and  the  service 
overhead  of  the  database  unit.  All  performance  measures  are 
examined  with  respect  to  the  likelihood  of  decision  error  and 
the  amount  of  control  action  delay. 


I.  Introduction 

recent  effort  to  install  and  test  monitoring  tools  and  to 
increase  the  level  of  redundancy  in  critical  subsystems 
in  air  operation  centers  has  provided  opportunities  for  vast 
performance  improvement  in  its  command  and  control 
supporting  systems.  Our  previous  work  on  a  controlled 
processing  unit  111  has  demonstrated  that  reduced  response 
time  to  service  requests  and  shortened  periods  of  system 
unavailability,  as  a  result  of  automated  monitoring  and 
control,  can  raise  significantly  the  probability  to  attain  the 
desired  outcome  in  an  air  operation.  A  more  recent  study  by 
Wu,  Metzler,  and  Linderman  121  on  a  database  unit  as  shown 
in  Fig.l  further  revealed  the  benefits  of  a  conscientious 
design  of  redundant  architecture,  and  the  application  of 
supervisory  control,  which  were  measured  in  terms  of  the 
mean  time  to  unit  failure,  the  steady  state  availability,  the 
expected  response  time,  and  the  service  overhead  of  the 
database  unit. 

To  assess  the  performance  in  a  quantified  manner,  both 
the  processing  unit  111  and  the  database  unit  (Fig.l)  121  were 
given  the  interpretation  of  a  queuing  network  [3]’  14  with 
specific  sets  of  operating  policies  and  structural  parameters. 
The  control  authorities  considered  included  the  ability  to 
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restore  the  first  failed  server,  and  the  ability  to  route  service 
requests.  In  order  to  obtain  an  analytic  model  of  manageable 
size  for  scrutinizing  the  effects  of  supervisory  control,  the 
queuing  network  was  restricted  to  the  closed  type  131  [41.  In 
addition,  all  the  event  lifetime  distributions  were  assumed  to 
be  exponential.  A  simulation  study  was  conducted  by  James 
Metzler  et  al.,  [5]  using  Arena  [7]’  181  with  all  the  above 
restrictions  removed. 


Fig.l  A  partitioned  database  unit 

An  underlying  assumption  of  the  existing  study  is  that  the 
state  information  in  the  queuing  network  model  of  a  given 
unit  is  known  exactly  at  any  given  time.  In  reality,  however, 
it  is  not  practical  to  monitor  every  state  variable.  As  a  result, 
the  knowledge  on  a  certain  set  of  states  is  inferred  based  on 
the  observables.  On  the  other  hand,  control  actions  are  likely 
required  at  the  time  of  a  state  transition,  such  as  the 
occurrence  of  a  component  failure,  in  which  case  a  process 
of  diagnosis  must  take  place  before  a  state-based  control 
action.  The  time  required  for  diagnosis  can  be  random,  and 
the  outcome  of  the  diagnosis  can  be  uncertain.  The 
objectives  of  this  paper,  therefore,  are  to  seek  for  ways  to 
incorporate  the  effects  due  to  decision  errors  and  control 
action  delays  into  the  Markov  model  of  a  queuing  network, 
and  to  use  the  model  to  access  the  impact  of  such  errors  and 
delays  on  the  performance  of  the  database  unit  in  Fig.  1 . 

The  paper  is  organized  as  follows.  Section  II  describes  the 
baseline  model  of  the  controlled  database  unit  in  Fig.l. 
Section  III  discusses  our  approaches  to  modeling  the  effects 
of  control  delays  and  decision  errors.  Section  IV  presents  the 
results  of  performance  evaluation  parameterized  with  respect 
to  the  amount  of  control  action  delay  and  the  probability  of 
error. 

II.  Baseline  model  for  a  controlled  database  unit 

The  description  of  the  baseline  model,  i.e.,  the  model  that 
does  not  include  decision  errors  and  control  delays,  follows 
to  a  large  extent  that  of  Wu,  Metzler  and  Linderman  121 .  The 
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database  unit  in  Fig.l  contains  three  servers  in  parallel  to 
answer  three  classes  (A,  B,  C)  of  queries  for  which  relevant 
information  can  be  found  in  the  partitioned  sets  A,  B,  C  of 
the  database,  respectively.  Server  Sab  contains  database  class 
A  as  the  primary  class  and  database  class  B  as  the  secondary 
class.  Server  Sbc  contains  database  class  B  as  the  primary 
class  and  database  class  C  as  the  secondary  class.  Server  Sca 
contains  database  class  C  as  the  primary  class  and  database 
class  A  as  the  secondary  class.  The  failure  of  a  server  implies 
the  loss  of  two  classes  of  data  within  the  server.  A  system 
level  failure  is  declared  when  two  servers  fail,  in  which  case 
one  class  of  data  is  completely  lost.  The  queues  preceding 
servers  SAB,  SBC,  and  SCA  are  named  QAC,  Qbc,  and  QCa, 
respectively.  All  queues  are  of  sufficient  capacity.  Service  is 
provided  on  a  FCFS  basis  at  each  server. 

The  three  delay  elements  of  average  delay  1/A.  imply  that 
there  are  always  three  customers  present  in  the  unit  at  any 
given  time.  A  new  query  is  generated  at  a  delay  element 
upon  the  completion  of  the  service  to  a  query  at  one  of  the 
servers.  The  delay  elements  are  intended  to  be  also  reflective 
of  the  response  time  to  the  querying  customers  by  other 
service  nodes  in  the  system  that  are  not  explicitly  modeled. 
Any  new  query  is  assumed  to  be  equally  likely  to  seek 
database  class  A  or  B  or  C.  Therefore  routing  probabilities 
pAB ,  Pbc ,  and  Pca  are  assigned  the  same  values. 

The  use  of  a  queuing  network  model  for  the  database  is 
based  on  its  suitability  to  involve  control  actions  and  to 
capture  their  effects  on  the  system  performance.  The  model 
is  built  in  this  study  with  the  premise  that  event  life 
distributions  have  been  established  for  the  process  of  query 

generation  (exp(/l)  =  7  —  e~^) ,  the  process  of  service 
completion  (exp( p )) ,  the  process  of  server  failure 
(exp(v)),  the  process  of  data  restoration  (exp (/)) ,  and  the 
process  of  unit  overhaul  (exptftj ))  when  the  failed  database 
unit  is  repaired.  All  such  processes  are  independent. 
Standard  statistical  methods  that  involve  data  collection, 
parameter  estimation,  and  goodness  of  fit  tests  exist  for 
identifying  event  life  distributions.  Since  all  event  lives  are 
assumed  to  be  exponentially  distributed,  the  database  unit 
can  be  conveniently  modeled  as  a  Markov  chain  specified  by 
a  state  space  X,  an  initial  state  probability  mass  function 
(pmf)  7TX(0),  and  a  set  of  state  transition  rates  A  [9,10l  The 
reader  uninterested  in  the  details  of  model  building  can 
advance  to  the  paragraph  right  above  Equation  (1). 

1 )  State  space  X 

A  state  name  is  coded  with  a  6-digit  number  indicative  of 
all  queue  lengths  and  server  states  in  the  unit.  With  some 
abuse  of  notations,  a  valid  state  representation  is  given  by 
x=Qa bQbcQ caSa bSbcSCA ,  where  queue  length  QAB,  QBC,  QCA 
e  [0,  1,  2,  3}  with  total  length  L  =  QAb+Qbc+  Qca  -  3 ,  and 
server  state  SAB,  SBC,  SCA  e  [0,  1,  2j.  Server  state  “2”  =  data 
are  lost  in  both  the  primary  and  the  secondary  classes  in  a 
server,  “7”  =  the  data  in  the  primary  class  have  been  restored 


and  data  in  the  secondary  class  have  not  been  restored,  and 
“0”  =  data  in  both  primary  class  and  secondary  class  in  a 
server  are  intact.  A  server  is  said  to  be  in  the  down  state  if  it 
is  either  at  state  “7”  or  at  state  “2”.  For  example,  state 
110020  indicates  that  server  SAB  is  up  with  one  customer  in 
its  queue,  server  SBc  is  down  with  both  classes  of  data  gone 
and  one  customer  in  its  queue,  and  server  SCA  is  up  and  idle. 
Note  that  the  queue  length  includes  the  customer  being 
served.  There  are  540  valid  states  in  the  baseline  system.  The 
total  number  of  states  is  reduced  to  141  when  all  the  states  of 
system  level  failures  are  aggregated.  A  set  of  alternative  state 
names  are  assigned  from  X  =  {7,  2,  ...,  141}  with  000000 
mapped  to  x=l  and  the  aggregated  system  failure  state 
mapped  to  x=141. 

2)  Initial  state  pmf  {7IX(0),  x=l,2,...,141] 

It  is  assumed  that  the  database  unit  starts  operation  from 
state  x=l,  i.e.,  the  initial  state  probability  is  given  by  vector 
MO)  =  [7  0  ...  0 ].  When  overhaul  is  considered  at  the 
occurrence  of  a  system  level  failure,  all  customers  are 
flushed  out  to  the  delay  elements.  Once  the  database  unit  is 
renewed  and  ready  for  operation  again,  it  starts  at  the  same 
initial  state  x=l,  and  a  renewal  process  1101  is  formed. 

3)  Set  of  state  transition  functions  Pij(t) 

Events  that  trigger  the  transitions  and  the  corresponding 
transition  rates  are  given  as  follows.  A  newly  generated 
query  enters  one  of  the  servers  with  rate  (3-L)xAl3 .  A 

query  is  answered  at  a  server  with  rate  p.  A  complete  data 
loss  occurs  at  a  server  with  rate  v.  Data  in  the  primary  data 
class  of  a  server  are  restored  with  rate  ypui,  and  data  in  the 
secondary  data  class  of  a  server  are  restored  with  rate  y„ 
where  «,  authorizes  whether  to  restore  the  lost  data  for  the 
primary  class.  Finally,  the  failed  database  unit  is  renewed 
with  rate  co  u3  where  u3  decides  whether  to  repair  the  failed 
system. 

Let  X  e  X denote  the  random  state  variable  at  time  t.  The 
set  of  state  transition  functions 

Pij (0  =  P  [X (t)  =  j  I  x (0)  =  i],  i,  j  =  1, 2,  ■  ■  ■ ,  141  (1) 

for  the  continuous-time  Markov  chain  can  be  solved  from  the 
forward  Chapman-Kolmogorov  equation  [7] 

Pit)  =  P(t)Q(uj,u3f  P(0)  =  7,  Pit)  =  [ pitjit )]  ,  (2) 

where  Q(uj,u3)  is  called  an  infinitesimal  generator  or  a  rate 
transition  matrix  whose  (ij)'h  entry  is  given  by  the  rate 
associated  with  the  transition  from  current  state  i  to  next  state 
j  in  the  rate  transition  table.  State  probability  mass  function 
at  time  t 

7r(t)  =  [7T1(t)  n2it)  •••  n141(t)],t>0  (3) 

is  computed  by 

;r(t)  =  n(0)P(t).  (4) 

At  this  point  a  baseline  Markov  model  for  the  database 
unit  of  Fig.l  has  been  established.  Since  transition  rate 
matrix  Q  is  dependent  on  control  actions,  the  state  transition 
functions  p,j(t)  are  being  controlled,  and  so  are  the  state 
probabilities. 
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B.  Restoration  and  overhaul 

Our  ultimate  goal  is  to  eliminate  all  single  point  failures, 
and  to  mitigate  the  effects  of  a  single  server  failure  on  the 
performance  of  the  database  unit.  Our  approach  is  to  base  the 
supervisory  control  actions  on  the  state  information,  which 
effectively  alter  the  transition  rates  when  loss  of  data  occurs 
in  a  single  server. 

Taking  into  consideration  the  symmetry  of  the  model,  the 
control  policy  is  described  only  for  the  case  of  a  failed  server 
Sab-  The  control  policies  considered  for  this  study  are 
summarized  as  follows. 

0,  SAB  =  2,  SBC  serves,  SCA  serves  (no  restoration)  ^ 
1,  SAB  =2,  SB£  serves,  Sqa  restores  class  A  data 

The  presence  of  supervisory  control  in  the  transition  rate 
matrix  is  seen  via  u2,  u2,  1-uj,  and  l-u2.  The  values  of  uu  u2 
represent  specific  control  actions  associated  with  data 
restoration,  and  unit  overhaul,  respectively.  Unit  overhaul 
occurs  only  at  the  unit  failure  state  141. 

The  complete  baseline  model  is  provided  in  [2]  in  the 
form  of  a  rate  transition  table,  where  an  additional  control 
variable  u2  was  present.  u2  controls  routing  probabilities 
when  data  loss  occurs  in  a  server.  u2  is  removed  in  this  paper 
because  the  small  number  of  queries  in  the  system  makes  the 
additional  benefit  afforded  by  routing  control  less  obvious  to 
observe. 


consequence  of  a  wrong  decision,  none  of  the  servers  can 
process  queries  for  a  period  of  time.  The  database  unit  is  said 
to  have  entered  an  intermittent  error  state.  It  is  assumed  that 
from  this  state,  only  transitions  to  more  server  failures,  or  to 
the  recovery  to  original  destination  state  can  occur.  Fig. 2 
depicts  a  generalized  representation  of  such  a  case. 

Without  loss  of  generality,  let  A  be  a  state  that  is  entered 
upon  a  total  data  loss  in  a  server.  Let  C  be  the  state  entered 
upon  the  completion  of  primary  database  restoration 
associated  with  the  data  loss.  Let  Bj  through  B„  be  the  states 
representing  completions  of  services  at  other  n  servers.  Let 
G/,  ...,  G,  be  the  state  entered  upon  the  arrival  of  a  new 
query  in  one  of  the  server  queues.  Let  F \  through  Fm  be  the 
states  entered  upon  data  loss  at  other  m  servers.  The  notion 
of  intermittent  state  I  is  introduced,  as  shown  in  Fig. 2,  to 
allow  the  representation  of  imperfect  decision  making  upon 
entering  A.  Therefore,  there  is  an  intermittent  error  state  for 
each  state  that  involves  outgoing  transitions  with  weakened 
control  authorities  due  to  some  decision  errors.  In  the 
database  unit  of  Fig.l,  altogether  60  states  are  added  to  the 
original  141  state  baseline  model.  Note  that  states  G,’s  are 
not  shown  explicitly  in  Fig. 2,  and  they  can  be  regarded  as 
part  of  Fi  s  from  this  point  on.  It  is  assumed  that  once  the 
primary  database  restoration  takes  place  for  a  particular 
server,  the  secondary  restoration  is  error  free. 


III.  Model  augmentation  to  include  errors  &  delays 

This  section  focuses  on  modeling  the  effects  of  decision 
errors  and  control  action  delays  upon  entering  a  state.  These 
two  undesirable  effects  can  be  intertwined.  To  quantify  their 
individual  impact  on  performance,  they  are  separated  into  the 
class  of  decision  errors  when  a  control  action  is  taken 
incorrectly  but  immediately  upon  entering  a  state,  and  the 
class  of  delayed  control  actions  when  a  correct  control  action 
is  taken  but  after  some  time  delay.  In  addition,  there  are 
deterministically  diagnosable  systems  for  which  the  only  cost 
of  diagnosis  is  time  191 .  Two  augmented  models  will  be 
generated  in  this  section  representing  a  controlled  database 
unit  with  decision  error,  and  one  with  control  action  delay, 
respectively.  Each  model  will  contain  201  states. 

A.  Effect  of  decision  error 

The  supervisory  control  considered  in  this  study  is  state 
information-based.  Upon  entering  a  state,  say,  A,  any 
information  deficiency  can  result  in  uncertainty  in  decision 
making  as  to  whether  to  take  a  control  action  or  what  control 
actions  to  take.  In  this  case,  every  decision  carries  a  risk. 

An  example  of  a  decision  error  with  the  database  unit 
would  be  that  upon  a  server  failure  a  wrong  server  is  being 
identified  as  having  failed.  More  specifically,  SAB,  for 
instance,  has  failed.  Sca ,  however,  is  mistakenly  thought  to 
be  the  failed  one.  Based  on  the  false  information,  the  control 
action  would  be  for  SBc  to  restore  data  class  C  in  Sca, 
whereas  Sab  would  be  expected  to  continue  to  work.  As  a 
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Fig. 2  Decision  error  modeling  w.  an  intermittent  error  state 


Let  Aa  (•  denote  the  transition  rate  from  state  A  to  state  C 

in  the  absence  of  decision  error  to  restoration  of  primary 
database  associated  with  the  most  recent  data  loss.  Let  u  be 
the  probability  of  successful  restoration  given  that  the  event 
of  restoration  occurs.  ( 1-u )  then  is  referred  to  as  the 
thinning  191  of  the  Poisson  arrival  process  associated  with  the 
restoration.  The  split  of  rate  XA  c  into  rate  uXA  c  and  rate 

( l  —  u)XA  c  is  sometimes  also  called  a  decomposition  1101  of  a 

Poisson  arrival  process  into  type  1  with  probability  u  and 
type  2  with  probability  (1-u). 

An  imperfect  decision  corresponds  to  the  value  of  u  being 
less  than  unity.  As  a  consequence,  the  authority  of 
supervisory  control  that  is  supposed  to  reinforce  the 
restoration  process  has  been  weakened.  The  smaller  the 
value  of  m,  the  weaker  the  control  authority  is. 

The  rate  of  recovery  from  decision  error  is  denoted  by  rc. 
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To  state  the  fact  that  recovery  from  an  intermittent  error  state 
to  restoration  cannot  be  faster  than  the  error-free  (u=  I ) 
restoration  process,  rc  <  Aac  is  enforced.  On  the  other 


hand,  the  outgoing  transition  rates  from  the  intermittent  error 
state  to  the  states  of  data  loss  in  other  servers,  i.e.,  from  I  to 
F„  i=l,  2,  ...  ,  m,  are  bounded  below  by  the  corresponding 
rates  going  from  A  to  F,.  These  transitions  further  reduce  the 
likelihood  of  reaching  state  C. 

It  is  now  shown  that  decision  errors  always  degrade  the 
performance  in  terms  of  the  state  transition  probability  Pac 
which  is  the  probability  that  restoration  to  state  C  occurs 
given  that  the  state  is  A.  It  turns  out  that  this  probability  is 
readily  obtained  for  a  Markov  chain  [91. 


Pac  = 


1,A-ac 

A  (A)  ’ 


(6) 


where 


A(A)  -  AAB]  +  ■  ■  ■  +  AABn  +  AAF]  +  ■  ■  ■  +  AAFm  +  Aac  (7) 


without  decision  error,  in  which  case  u  =  1  in  (6),  and 
A (A)  =  AABj  +---  +  AABii  +  AAF]  +---  +  AAFm  +uAac  +  (1-u)Aac  (8) 

with  decision  error,  in  which  case  u  <7. The  denominators 
of  (7)  and  (8)  are  the  same.  Apparently,  (6)  is  proportional  to 
m,  and  is  the  largest  at  u=l  when  there  is  no  decision  error. 
On  the  other  hand,  flow  balance  at  state  7  yields 


m 


Ki  —  (l-u)AACn  A  —  (  X  ^i,Fj  +  rc)*i  >  (9) 

i=l 

from  which  the  following  expression  for  Kr{t)  in  terms 


of  7tA(t)  at  steady  state  is  obtained 


(7  -  u)Aac 

x-hM  +  rc 


xA(°°)  ■ 


(10) 


(10)  is  proportional  to  1-u. 

Some  results  of  numerical  calculation  will  be  presented  in 
Section  IV  based  on  the  state-augmented  model  of  the 
database  unit  of  Fig.  1  that  show  how  certain  performance 
measures  depend  on  the  probability  of  the  restoration 
decision  error. 


B.  Effect  of  delayed  control  actions 

Time  required  for  diagnosis  can  be  regarded  as  the 
universal  cause  of  a  control  action  delay.  Time  delay  can  be 
traded  off  in  some  applications  with  the  decision  error  to 
minimize  their  combined  effects.  This  subsection  focuses  on 
the  discussion  of  the  effect  of  time  delay  alone. 

An  example  on  the  control  action  delay  with  the  database 
unit  of  Fig.  1  would  be  that  a  total  loss  of  data  on  a  server  is 
not  immediately  observed.  As  a  result,  the  action  of  data 
restoration  is  delayed. 

As  in  the  previous  subsection,  let  A  be  a  state  that  is 
entered  upon  a  total  loss  of  data  in  a  server.  Let  C  be  the 
state  entered  upon  the  completion  of  primary  database 
restoration  associated  with  the  data  loss.  States  Bj  through 
Bn,  and  states  F\  through  Fm  also  follow  the  earlier 
definitions.  Fig. 3  depicts  a  proposed  model  capable  of 


describing  a  delayed  restoration  action  by  an  exponentially 
distributed  random  amount  with  average  S_1  upon  entering 
state  A. 

In  a  more  general  case,  there  can  be  an  A-phased  delay 
implemented  in  the  augmented  model  by  inserting  A  states 
D i  through  DN  in  series  between  states  A  and  C.  Each  state 
7),  retains  outgoing  transitions  to  all  Bj  through  Bn,  and  Tq 
through  Fm,  in  addition  to  transition  to  Di+1.  The  total 
amount  of  delay  before  restoration  action  is  bounded  below 

by  random  variable  D  =  Dj-\ - 1-  DN  ,  with  a  generalized 

Erlang  distribution  1101 

-/  N  S: 

(11) 

i=l  s  +  Oj 

One  may  use  an  A-stage  Erlang  to  approach  a  constant  delay, 
or  an  A-stage  hyper-exponential  to  approach  a  highly 
uncertain  delay,  or  a  mixture  of  the  two  to  acquire  more 
general  properties  [9]. 
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Fig. 3  Control  delay  modeling  w.  a  single-stage  delay  state 


Note  that  there  are  two  significant  differences  between  the 
decision  error  model  of  Fig.2  and  the  control  delay  model  of 
Fig. 3.  First,  the  link  to  restoration  of  primary  database  is 
present  in  Fig.2  with  a  smaller  likelihood  of  transition, 
whereas  the  link  to  restoration  without  delay  is  absent  in 
Fig. 3.  In  addition,  all  links  to  sendee  completion  are  absent 
in  Fig.  2,  but  present  in  Fig. 3.  Therefore,  these  are  two  cases 
of  different  nature. 

With  a  single-stage  delay  for  each  state  entered  upon  a 
total  loss  of  data  in  a  server,  60  states  are  added  to  the 
baseline  model.  Numerical  results  on  the  effect  of  control 
action  delay  will  be  presented  in  the  next  section. 


IV.  Performance  analysis  and  discussion 
A.  Time  to  system  failure 

When  uj=0,  the  augmented  Markov  chain  model  for  the 
database  unit  contains  one  absorbing  state  x=201  at  which 
the  chain  remains  forever  once  it  is  entered.  This  is  the  state 
of  system  level  failure.  The  rest  of  200  states  are  transient 
states.  Decompose  the  state  probability  vector 

g(0  =  [fe.(0  KaW),  (12) 

1x200  lxl 

where  vector  n£f)  contains  the  transient  state  probabilities. 
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and  7ia (t)  is  the  absorbing  state  probability.  Decomposing  the 
rate  transition  matrix  Q  and  the  state  transition  function 
matrix  P(t)  solved  from  (2)  accordingly  yields 


Q  = 


Qll  Ql2 
o  o 


,P(t)  = 


Pi  i(t) 
0 


Pnit) 

1 


(13) 


From  (2),  (4),  and  (12),  it  can  be  determined  that  the 
probability  density  function  of  time  to  system  failure,  or  time 
to  absorption,  is  given  by 

Xa(t)  =  ^T(0)Pn(t)Qi2,  Ka{0)  =  0,  (14) 

where 


7TZ(0)  =  [1  0  ••■],  Pn(t)  =  eQllt.  (15) 

In  addition,  the  mean  time  to  failure  of  the  database  unit  can 
be  shown  to  be  191 


MTTF  =-7rT(0)Qi  j  lT,  lT 


1 

1 


(16) 


Fig.  4  below  shows  the  dependence  of  mean  time  to 
failure  of  the  database  unit  on  probability  of  correct  control 
action  for  data  restoration  with  restoration  rate  y  as  a 


parameter.  The  plot  indicates  that  MTTF  is  sensitive  to 
restoration  rate,  and  becomes  more  sensitive  to  supervisory 
control  coverage  at  a  higher  restoration  rate.  The  relative 
robustness  of  MTTF  with  respect  to  supervisory  control 
coverage  can  be  attributed  to  the  fact  that  recovery  has  taken 
a  most  optimistic  path  with  rc  =  Aac  ,  after  a  decision  error 
has  been  made. 


l.P=n 3.  -rrtrtl 
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Fig.4  Unit  MTTF  versus  control  coverage 


Fig. 5  Unit  MTTF  versus  control  delay 
Fig.  5  above  shows  the  dependence  of  mean  time  to  failure 


of  the  database  unit  on  expected  control  action  delay  for  data 
restoration  with  restoration  rate  y  as  a  parameter.  It  is 
expected  that  control  action  delay  affects  MTTF  more 
drastically  when  restoration  rate  is  high.  Control  action  delay 
becomes  dominant  in  how  long  it  takes  to  restore  data  when 
it  becomes  comparable  to  average  time  required  to  perform 
data  restoration. 

B.  Steady-state  availability 

Suppose  as  soon  as  the  database  unit  reaches  a  system 
level  failure,  an  overhaul  process  starts  with  all  the 
customers  flushed  out  to  the  delay  elements.  Suppose  with  a 
rate  <y  the  unit  is  repaired.  At  the  completion  of  the  repair  to 
condition  7t(0 ) ,  the  unit  immediately  starts  to  operate  again. 
In  this  case  u 3  is  set  to  1  in  the  model,  whereas  it  is  set  to  0  in 
the  case  of  an  absorbing  chain.  The  existence  of  a  unique 
steady-state  distribution  of  the  Markov  chain  when  m  j=7  is 
guaranteed  if  the  chain  is  irreducible  (or  ergodic)  [10].  The 
steady  state  availability,  which  can  be  roughly  thought  of  as 
the  fraction  of  time  the  database  unit  is  up,  is  given  by 

AsyS  =  1-^20 1(°°)’  (I7) 

where  K201  (°°)  is  determined  by  solving 

n(°°)Q  =  0,  and  I^W00)  =  A  (18) 


Fig.6  Steady-state  availability  versus  control  coverage 


Fig.7  Steady-state  availability  versus  control  delay 


Fig.6  and  Fig.7  show  the  steady-state  availability  as  a 
function  of  supervisory  control  coverage  and  a  function  of 
expected  control  action  delay.  It  can  be  seen  that  both  long 
delays  and  slow  restoration  reduce  the  availability  to 
unacceptable  levels.  Explanations  on  the  insensitivity  of  the 
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availability  with  respect  to  coverage  and  delay  under  slow 
restoration  conditions  follow  those  for  Fig.4  and  Fig. 5. 

C.  Response  time 

Consider  again  the  irreducible  chain  studied  in  the  previous 
section.  Let  /,  .•  be  the  indicator  function  associated  with 

transition  from  state  i  to  state  j,  and  q ^  be  the  corresponding 
entry  in  transition  rate  matrix  Q.  Let  Nj  be  the  total  number 
of  queries  in  queue  at  state  i.  Then  the  total  expected  number 
of  queries  in  queue  at  steady-state  is  given  by 
201 

E[X]=ZxiMNi,  (19) 

i=l 

and  the  arrival  rate  at  steady-state  is 

201  201 

4  =  Z  Z  Iijqjj  .  (20) 

i=l  j=l 

The  calculation  of  the  response  time  at  steady-state  then 
follows  Little’s  Theorem  E[X]  =  4 E[ A1] . [4] 

Fig. 7  and  Fig. 8  show  the  average  response  time  as  a 
function  of  supervisory  control  coverage  and  a  function  of 
control  action  delay,  respectively.  Unlike  the  other 
performance  measures,  the  sensitivity  of  the  average 
response  time  remains  relatively  significant  at  a  low 
restoration  rate. 


v4  jptf.wl*  »«0li 


Fig. 8  Average  query  response  time  versus  control  coverage 
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Fig. 9  Average  query  response  time  versus  supervisory  control  delay 
D.  Overhead 

Overhead  is  a  quantity  introduced  to  reflect  the  ratio  of  the 
time  invested  on  helping  the  database  unit  to  survive  longer 
to  its  overall  busy  time.  It  is  a  measure  of  the  cost  of 
supervisory  control.  More  specifically. 


PriZi#  restores  or  fails  I  unit  is  not  failed]  (21) 

restores  or  fails  or  serves  I  unit  is  not  failed] 
Overhead  6  is  calculated  for  the  irreducible  chain  (u  :=l ) 
as  a  function  of  supervisory  control  coverage  and  a  function 
of  supervisory  control  delay.  These  are  shown  in  Fig.  10  and 
Fig.  1 1 .  As  in  the  case  of  availability,  overhead  at  the  steady- 
state  becomes  unacceptably  high  at  low  restoration  rate.  It  is 
also  sensitive  to  control  coverage  and  delay  when  restoration 
rate  is  high. 


Fig.  10  Service  overhead  versus  control  coverage 
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Fig.l  1  Service  overhead  versus  control  delay 
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